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(54) Adaptive web crawling using a statistical model 

(57) A computer based system and method of re- 
trieving information pertaining to documents on a com- 
puter network is disclosed. The method includes select- 
ing a set of documents to be accessed during a Web 
crawl by utilizing a statistical model to determine which 
previously retrieved documents are most likely to have 
changed since last accessed. The statistical model is 
continuously improving its accuracy by training internal 
probability distributions to reflect the actual experience 
with change rate patterns of the documents accessed. 
The decision made whether to access the document is 
based on the probability of change compared against a 
desired synchronization level, random selections, max- 
imum limits on the amount of time since the document 
was last accessed, and other criterion. Once the deci- 
sion to access is made, the document is checked for 
changes and this information is used to train the statis- 
tical model. 
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Description 

Field of the Invention 



[0001] The present invention reiates to the field of network information software and, in particular, to methods and 
systems for retrieving data from network sites. 

Background of the Invention 

a user aiaioMain .lata andsar»eaes.>y sending,^ 

Simple MailTranrfe, Protocol (SMTP,, ana , he ^^S^^^^.o^^.wab.-ThaWoHd 

[0003, Th. HTTP <tom '"" S - " n ' e W ° ,ld 

Wide Web is an information service on the Imerner Pmvio ng maintain add distribute documents. The 

Wide Web la made up ol numerous Web sitea located around ^°''^' , ^ n ™° aIi< , n ammM) raforrad to 
.ooadon o, a crimen, o-£ 

rurnrro,™,ri»r, U d^ 
;s^er^=^ois~^ 

browser application. fQ5 „ inrt w^h «^rvers and client computers operating in a manner similar 

Web sites while conduct.ng a Web crawl. 7 h ^ eb ^ r 7'°^ , " e S str ^ ion , ha t define the scope of the crawl. The 
addresses that act as seeds for the crawl and a set o J^^^f,,^ in the documents retrieved during 
Web crawler recursively gathers network addresses ^^T^^^ ^fved document data from the 
the crawl. The Web crawler retrieves the document f rom a Web FoTexample, a Web crawler may 

crawlers can vis* on V a small fraction of the documenu ^^^^^^^^^Z^m. a document 
Web will change over time with some documents changing^ "^^aiiSir a price list on a company's Web 
published on a Web site by a news organization ^"^^SSTSS never change. Without regard to the 
site may change once a year, and a documen on a personal Web sjt ™V J synchronization with the 

ES!S:«3S^^ — tJse — retrieved 

. document based in part on the probab.lity that the * ^ tto access a Web doC ument 

was last accessed. Preferabfy. such - ""^^ Seethe o^of the document. The mechanism 

without having to establish a connection with a host serve r that M o « tQ 

would also preferabV provide a a" S changed documents encountered during 

documents based on the f^^^^^^L . document, the mechanism should provide a way to 
S^^S^IS r indeed changed. The present invention is directed to providing 
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such a mechanism 
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Summary of the Invention 

[0009] In accordance with one aspect of the invention, computer-based methods and systems for retrieving data 
from a computer network are provided. The methods and systems of the present invention optimize a Web crawler's 

5 use of computer resources when performing adaptive incremental Web crawls to maintain the synchronization between 
local data copied from a document when it was previously retrieved and current data contained in the document which 
may have been changed since the document was last retrieved. To intelligently determine which documents are most 
likely to have changed since a previous retrieval, the methods and systems of the present invention adaptively decide 
on whether or not to access a previously retrieved document during a current Web crawl based in part on a statistical 

10 model. 

[001 0] In accordance with other aspects of the invention, each Web crawl begins with an active probability distribution 
containing a plurality of probabilities indicative that a document has changed at a given change rate. A history map is 
maintained by the Web crawler that references a number of documents that were accessed during previous Web 
crawls. For each referenced document in the history map, a document probability distribution is initialized as a copy 

*5 of the active probability distribution. The document probability distribution is trained under a statistical model. The 
training is based on changes to the document experienced by the Web crawler during the previous Web crawls. A 
probability that the document has changed during an interval of interest is then computed based on the document 
probability distribution and the statistical model. A decision to access or not to access the document is made with the 
aid of this computed probability. 

20 [0011] In accordance with additional aspects of the invention, the document probability distribution is trained for 
events as experienced with the document upon previous accesses. These events may include "change events" or "no 
change events." A change event may be where the document was found to have changed in some substantive manner 
since the last access of the document. A no change event may be where an access to the document determines that 
the document has not changed. A no change event determination may be made in many ways, such as by evaluating 

25 a time stamp associated with the document, or if no substantive change is found when a hash value of the currently 
retrieved document matches a hash value of the previously retrieved document. Events such as "no change chunk 
events" may also be interpolated from experienced events, as is described in detail below. 

[0012] The probability that the document has changed (the "document change probability") is computed based on 
the document probability distribution. A bias is then computed based on the document change probability in conjunction 
30 with a synchronization level. The synchronization level may be a predefined value that specifies the percentage of 
documents that are expected to be synchronized at any given time. A decision whether to access the document is 
made based on a "coin-flip" using the computed bias. 

[0013] In accordance with further aspects of the invention, the methods and systems of the present invention con- 
serve computer resources by balancing the need for accuracy in the statistical model against the computer storage 

35 and computing resources available. In an actual embodiment of the invention, a minimal amount of historical information 
is maintained for each document in a history map. This historical information is converted by the method and systems 
of the present invention to interpolate change events, no change events, and no change chunk events by mapping 
data recorded in the history map to a timeline. From the interpolation, the variables required by the statistical model 
can be determined with reasonable accuracy, given the limited resources available to the Web crawler and the need 

40 for speedy processing when conducting a Web crawl. 

[0014] In accordance with still further aspects of the invention, at the start of each adaptive incremental crawl a 
training probability distribution is initialized to essentially zero by multiplying a copy of a base probability distribution 
(containing a starting point estimate of probabilities that a document will change at a given change rate) by a small 
diversity factor. The training probability distribution recursively accumulates the document probability distribution for 

45 each document processed during the Web crawl. By summing each probability in the training probability distribution 
with a corresponding probability from each document probability distribution, the training probability distribution repre- 
sents the accumulated experience-trained document probability distributions for all documents processed to that point 
in the current crawl. At the end of the current crawl, the training probability distribution is stored and used as the active 
probability distribution for the next crawl. 

50 [0015] In accordance with other aspects of this invention, once the decision is made to access the document, a 
document address specification for that document is added to a transaction log. To process the transaction log, the 
Web crawler first retrieves a time stamp for the document from the location specified by the document address spec- 
ification. That time stamp is compared with a time stamp associated with the version of the document previously re- 
trieved (stored locally). If the respective time stamps match, the current document is considered to be unchanged, and 

55 is therefore not retrieved during the current Web crawl. Preferably, the time stamp comparison is performed by sending 
a request to a server to transfer the document only if the time stamp associated with the document at the server is 
more recent than a time stamp included m the request. 

[0016] In accordance with other aspects of this invention, a secure hash function is used to determine a hash value 
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corresponding to eao h previous, retrieved ^"1— 

sequent Web crawis to determine whether the cor ^ n ^^ the prev ious.y retrieved document 

may be used to obtain a new hash value, wh.ch ,s compa * ^^stantively equivalent to the previously 
dam. If the hash values are equal, the con Sdered to be modeled and a change 

retrieved document data. If the hash va ues dfleMhe ££^££1 jncremented each time a network access is 
counter is incremented for the document. An ^££££«m<* timestamp is requested, 
attempting on the current document, such as when the .current ° jndudes assj jng a unique 

10017] In accordance with further ^J^^^^^ retrieved document correspondingto each 
current crawl numberto the Web crawl, and determm.ngwhetne acurr J previously retrieved document 

previously retrieved document copy is f* 1 *^^ ,f the previous, retrieved 

copy, in order to determine whether the document has been ' ^™ ^5 has been modified , the document's 

with the invention for retrieving data from previously ^^^ZxLe pr^o^ been retrieved is mini- 
way of retrieving and document data, where.n t0 p J om m ore comprehensive crawls, 
mized. The invention allows a Web crawier to PjT^^S^lSo^ craw, number when the document 
Assigning a crawTnumber modified to '^J^™; since the last time it was retrieved by 

time. 

Brief Description of the Drawings 
™ te onH m«nv of the attendant advantages of this invention will become more readily 

conjunction with the accompanying drawings, wherein: 

FIGURE I is a block diagram illustrating some of the components used ,n the "venaon 

MURES .4A-B ar* a lanodona, to. di.9»m MMhg . P-ocaaa o, the p.as.n. in»en,ion .0, defining I a 



50 



55 



Sgu^sVo"* A- 2 » -unci.™, no. di.gr.™ mating . proc. .« ttaMng M *«.«. pnMMty 

RGURE n: i 7 is a block diagram illustrating the process of accumuiating a training probabi.ny dfetribution in accord- 
ance with the present invention; and 
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FIGURE 18 is a flow diagram illustrating the process of performing a search for documerltf^llfaccordance with 
the present invention. 

Detailed Description of the Preferred Embodiment 

5 

[0020] The present invention is a mechanism for obtaining information pertaining to documents that reside on one 
or more server computers. While the following discussion describes an actual embodiment of the invention that crawls 
the World Wide Web within the Internet, the present invention is not limited to that use. This present invention may 
also be employed on any type of computer network or individual computer having data stores such as files systems, 

w e-mail messages and databases. The information from all of these different data stores can be processed by the 
invention together or separately. The present invention may also be used in any context in which it is desirable to 
maintain the synchronization of previously retrieved data with data as it may have been changed at its source. In 
addition to the application of the present invention in the Web crawler discussed below, another useful application of 
the present invention would be in a proxy server that stores local copies of documents that need to be "refreshed" at 

15 the proxy server when a source document changes. 

[0021] A server computer hosts one or more Web sites and the process of locating and retrieving digital data from 
Web sites is referred to as "Web crawling." The mechanism of the invention initially performs a first full crawl wherein 
a transaction log is "seeded" with one or more document address specifications. A current document at each document 
address specification listed in the transaction log is retrieved from its Web site and processed. The processing includes 

20 extracting document data from each of these retrieved current documents and storing that document data in an index, 
or other database, with an associated crawl number modified that is set equal to a unique current crawl number that 
is associated with the first full crawl. A hash value for the document and the document's time stamp are also stored 
with the document data in the index. The document URL, its hash value, its time stamp, its crawl number modified and 
other historical information (discussed below) are stored in a persistent history map that is used by the crawler to record 

25 the documents that it has crawled. 

[0022] Subsequent to the first full crawl, the invention can perform any number of full crawls or incremental crawls. 
During a full crawl, the transaction log is "seeded" with one or more document address specifications, which are used 
to retrieve the document associated with the document address specification. The retrieved documents are recursively 
processed to find any "linked" document address specifications contained in the retrieved document. The document 

30 address specification of the linked document is added to the transaction log the first time it is found during the current 
crawl. The full crawl builds a new index based on the documents that it retrieves based on the "seeds" in its transaction 
log and the project gathering rules that constrain the search. During the course of the full crawl, the document address 
specifications of the documents that are retrieved are compared to associated entries in the history map (if there is an 
entry), and a crawl number modified is assigned as is discussed in detail below. 

35 [0023] An adaptive incremental crawl retrieves only documents that may have changed since the previous crawl. 
The adaptive incremental crawl uses the existing index and history map. The transaction log is selectively seeded with 
the document address specifications based on a decision whether or not to access a previously retrieved document 
that is made utilizing a statistical model, random selection and a selection based on the amount of time since the last 
access of the document. In an adaptive incremental crawl, once a decision is made to access a previously retrieved 

40 document, the document data is retrieved from a Web site if its time stamp is subsequent to the time stamp stored in 
the Web crawler's history map. In other words, during an adaptive incremental crawl, a document is preferably only 
retrieved from a Web site following an access to determine if the time stamp on the document on the Web site is 
different than the time stamp that was recorded in the history map forthat URL. If the time stamp differs or is unavailable, 
the document is retrieved from the Web server. 

45 [0024] When the document data is retrieved, the invention determines if an actual substantive change has been 
made to the previously retrieved document. This is done by filtering extraneous data from the document data (e.g., 
formatting information) and then computing a hash value for the retrieved document data. This newly computed hash 
value is then compared against the hash value stored in the history map for previously retrieved document data. Dif- 
ferent hash values indicate that the content of the previously retrieved document has changed, resulting in the crawl 

50 number modified stored with the document data being reset to the current crawl number assigned to the Web crawl 
and a document change counter being incremented forthat document in its associated history map entry. 
[0025] Searches of the database created by the Web crawler can use the crawl number modified as a search pa- 
rameter if a user is only interested in documents that have changed, or that have been added, since a previous search. 
Since the invention only changes the crawl number modified associated with the document when it is first retrieved, 

55 or when it has been retrieved and found to be modified, the user can search for only modified documents. In response 
to this request, the intermediate agent implicitly adds a limitation to the search that the search return only documents 
that have a crawl number modified that is subsequent to a stored crawl number associated with a prior search. 
[0026] Web crawler programs execute on a computer, preferably a general purpose personal computer. FIGURE 1 
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in which the invention may be implemented. ^^™^™f*™*^ Xf * by a personal computer. Generally, 
of computer-executable instructions, such as program ™dules be, ng » £ rform particular tas ks 

program modules include routines programs^ objects that the invention may be 

or implement particular abstract data types. Moreover, thos skilled ,n the _art w PP microproc- 
practLdwithothercomputersystemconf.gura,ons,.nc.^ andthe 

essor-based or programmable consumer *«*™«*>< ^^^S!^^ are performed by remote 

p:: c i^ 

computing device in the form of a conventional persona. ^^^^^^J^ to the processing unit 
22, and a system bus 23 that couples vanous system ^^SJ-tLy bus or memory controller, 
21 . The system bus 23 may be any of several J^^^J^^^^ory inc.udes read only 
a peripheral bus, and a local bus using any - - Jjja ^basfc inp^oSutsystL 26 (B.OS^containing the basic 
memory (ROM) 24 and random access memory (RAM) 25 _A bas. : inpu P > as du ^ 

routines that helps to transfer information between £mentow*tan the pe sonal I c p ^ ^ ^ g 

is stored in ROM 24. The P""££^£ 'J^'j^^L*. magnetic disk 29. and an 
hard disk, not slftwn, a magnet.c disk drive 28 for reaa.ng » or other optical media, 

optical disk drive 30 for reading from or wrlt.ng to a removable o^^^ 

The hard disk drive 27, magnetic disk drive 28 and optica. ^^J^^SSS, 34. respectively. The dr^es 
disk drive interface 32. a magnetic disk drive : interface 33 and an d ™ e readable instructions, data 

and their associated -^-^ exemplar environment de- 

structures, program modules and o her data for ^the persona comp removab |e optical disk 31 . it should be 

scribed herein employs a hard disk, a removable ******* f™ e a m ^ awhich C a„ st0 re data that is acces- 
appreciated by those skilled in the art that ° the ^ 6 ?^^ disks. Bernoulli cartridges. 

^TnUerof programme™ 

RAM 25, including an operat.ng system 35, one or mor ^ appi^a p 9 computer 20 through input devices 
program data 38. A user may enter commands and .n £ntf«n£ he £ microphone, joystick, 

such as a keyboard 40 and P°>^ ^JS2^d^«^ eonn-c^toth.pio^^ unit 

game pad, satellite dish, scanner, or the like. ^f s ® an °°^"T^ a but mav be connec ted by other interfaces, such 
21 through a serial port interface 46 that ,s ^coupled to he <^f££%Z™ ^ J dlsplay device is aiso 

computer, a server, a router, a network PC, a peer oevice ot u a i t houah only a memory storage device 
or aL the elements 

50 or 61 has been illustrated n FIG URE 1 The > l0 9^°™ envir0 nments are commonplace in offices, enterpr.se- 

(LAN) 51 and a wide area network (WAN) 52, Such " etw °™ n 9 . e ™ r , computer 60 communicates 

wide computer networks, intranets and the '"^^^ with the persona. 
wKhtnepersonalcompu^ 

computer 20 via the wide area network 52 One exam P ,e OT . ter20is connected to the local network 
[0030] When used in a local area forking, the persona, computer 
so 51throughanetworkintertaceoradapter53.Whenused,naw,de^ 

20 typica.ty inc.udes a modem 54 or other means for a >^^ZI^ via ^e serial port interface 46. In a 
modem 54. which may be o the p^a^mplr 20, or portions thereof, may be 

[0031] FIGURE 2 illustrates an ^emp.ary archrtectu * ^"^^ We P b craw .er program 206 



40 



45 
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remote server computer 21 8 depicted in FIGURE 2. The computer network 21 6 may be a locaFarea network 51 (FIG-'" 
URE 1 ), a wide area network 52, or a combination of networks that allow the server computer 204 to communicate 
with remote computers, such as the remote server computer 21 8, either directly or indirectly. The server computer 204 
and the remote server computer 218 are preferably similar to the personal computer 20 depicted in FIGURE 1 and 
5 discussed above. 

[0032] The Web crawler program 206 searches ("crawls") remote server computers 218 connected to the network 
216 for documents 222 and 224. The Web crawler 206 retrieves documents as document data. The document data 
from the documents 222 and 224 can be used in a variety of ways. For example, the Web crawler 206 may pass the 
document data to an indexing engine 208. An indexing engine 208 is a computer program that maintains an index 210 

io of documents. The type of information stored in the index depends upon the complexity of the indexing engine. 

[0033] A client computer 214, such as the personal computer 20 (FIGURE 1), is connected to the server computer 
204 by a computer network 212. The computer network 212 may be a local area network, a wide area network, or a 
combination of networks. The computer network 212 may be the same network as the computer network 216 or a 
different network. The client computer 214 includes a computer program, such as a "browser" 215 that locates and 

*5 displays documents to a user. 

[0034] When a user at the client computer 214 desires to search for one or more documents, the client computer 
transmits a search request to a search engine 230. The search engine 230 examines its associated index 21 0 to find 
documents that may relate to the search request. The search engine 230 may then return a list of those documents 
to the browsers 5 at the client computer 214. The user can examine the list of documents and retrieve one or more 

20 from remote computers such as the remote server computer 218. 

[0035] As will be readily understood by those skilled in the art of computer network systems, and others, the system 
illustrated in FIGURE 2 is exemplary, and alternative configurations may also be used in accordance with the invention. 
For example, the server computer 204 itself may include documents 232 and 234 that are accessed by the Web crawler 
program 206. Also the Web crawler program 206, the indexing engine 208, and the search engine 230 may reside on 

25 different computers. Additionally, the Web browser program and the Web crawler program 206 may reside on a single 
computer. Further, the indexing engine 208 and search engine 230 are not required by the present invention. The Web 
crawler program 206 may retrieve document information for use other than providing the information to a search engine. 
As discussed above, the client computer 214, the server computer 204, and the remote server computer 218 may 
communicate through any type of communication network or communications medium. 

30 [0036] FIGURE 3 illustrates, in further detail, the Web crawler program 206 and related software executing on the 
server computer 204 (FIGURE 2). As illustrated in FIGURE 3, the Web crawler program 206 includes a "gatherer" 
process 304 that crawls the Web and gathers information pertaining to documents. The gatherer process 304 is invoked 
by passing it one or more starting document address specifications, e.g., URLs 306. The starting URLs 306 serve as 
seeds, instructing the gatherer process 304 whereto begin its Web crawling process. A starting URL can be a universal 

35 naming convention (UNC) directory, a UNC path to a file, or an HTTP path to a URL. The gatherer process 304 inserts 
the starting URLs 306 into a transaction log 310. The transaction log 310 identifies those documents that are to be 
crawled during the current crawl. Preferably, the transaction log 31 0 is implemented as a persistent queue that is written 
and kept in a nonvolatile storage device such as a disk 27. Preferably, the Web crawler 206 maintains a small in- 
memory cache of transactions in the transaction log 310 for quick access to the next transactions. 

40 [0037] The gatherer process 304 also maintains a history map 308, which contains an ongoing list of all URLs and 
other historical information that have been accessed during the current Web crawl and previous crawls. The gatherer 
process 304 includes one or more worker threads 312 that process a URL until all the URLs in the transaction log 310 
have been processed. The worker thread 312 retrieves a URL from the transaction log 310 and passes the URL to a 
filter daemon 314. The filter daemon 314 is a process that retrieves document data from the previously retrieved doc- 

45 ument at the address specified by the URL. The filter daemon 314 uses the access method specified by the URL to 
retrieve the document. The access method may be any file access method capable of allowing the filter daemon 314 
to retrieve data, such as HTTP, File Transfer Protocol (FTP), file system commands associated with an operating 
system, or any other access protocol. 

[0038] After retrieving a document, the filter daemon 314 parses the document and returns a list of text and properties. 

so For example, an HTML document includes a sequence of properties or "tags," each containing some information. The 
information may be text to be displayed, "metadata" that describes the formatting of the text, hyperlinks, or other in- 
formation. A hyperlink typically includes a document address specification. The Web browser program 215 uses the 
hyperlink to retrieve the information at the location in the document address specification. The information may be 
another document, a graphical image, and audio file, or the like. 

55 [0039] Tags may also contain information intended for a search engine. For example, a tag may include a subject 
or category within which the document falls, to assist search engines that perform searches by subject or category. 
The information contained in tags is referred to as "properties" of the document. A document is therefore considered 
to be made up of a set of properties and text. The filter daemon 31 4 returns the list of properties and text to the worker 
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IS •me.istofpropertiestoradocumentincludesa.istofURLsthat are MuMlnl^^cbcM 
S-wLtar thread 312 passes this list of URLs to the history map 308. The history map 308 is illustrated in FIGURE 

1 and HsseTbetow Brief J s a ted. when a new or modified document is retrieved, the history map 308 checks each 
4 and disc tisse "eww^nerry , ga ^ Qn ^ map 3Q8 are added 

aV ot S b e ^SSSX. -rent craw,. Use of the history map 308 allows the Web crawler 
206 To avoid processing he same URL more than once during a crawl. The URLs that are not already ^listed on the 
Msto* map 308 are also added to the transaction log 31 0, to be subsequently processed by a worker thread 
SSq The worker thread 312 then passes the list of properties and text to the .ndexing engine ^208. The indexing 
InZnl poa creates an index 21 0 which is used by the search engine 230 in subsequent searches. 

2 FIGURE 4 HlusSes fan exemplary history map 308 in accordance with the present invention. Preferably, the 

As depicted the history map 308 includes multiple entries 410, one entry corresponding to each URL 412. Each URL 
JiZsSes^ 

ZTZZXSZ corresponding document when the Web crawler last retrieved the document ,s stored ,n the 



?00431 ^Ee history map also includes a hash value 416 corresponding to each document identified in the history 
l Zl hS lvalue 7esu.ts P ,rom apptying a "hash function" to the document. A hash function IS V^J^^SS 
mat transforms "a* digital document into a smaller representation of the document (called a hash value ) A ^ secure 
hash un^oTisahash function that is designed so that rt is computationally unfeasible to ,nd two d,ffe^^^^^^ 
that -hS To produce identical hash values. A hash value produced by a secure hash function serves as a d.grtal 
ELS! r onSe^document. The "MD5" is one such secure hash function, published by RSA Laboratone^f Redwood 
Tingerpnru u « u _ tifloH opr 1 suitable for use in conjunct on with the present invention. 

41 8 - The crawl number craw,ed 418 - spe T 

S^^SX?d!2S which the corresponding URL was processed. As discussed below, the craw numbe 
crawTeT^prevTs dup cate processing of URLs during a crawl. When a craw, is completed, the 
crawl^ 

mos LJnt clumber during which the corresponding document was determined to be; rnodif .e d Unl.ke the crew. 
rmbeTcrawlTd 418, the crawl number modftied 420 is only set to the current crew, number when the document ,s 

SltLi^^TSS in a statistical mode, for deciding if a document shou.d be accessed dU nng , ar . adapt^e 
ncrementa^rewl asts discussed below with reference to FIGU RE 8. The first access tone 422 » set wherr the doc- 
umenTts first Sesled- the last access time 424 is set the most recent time that the document was accessed he 
chTge VZZSZ fooler that is incremented each time the document is discovered to have changed in . 
stance wT and the access count 428 is a counter that is incremented each time an access ,s attempted for the 



SEJTin exemplary transaction log 31 0 is shown in FIGURE 5. The transaction log 310 contains entries ; 51 0 that 
[0047] An exem P'^ "a durino the Web crawl. In an actual embodiment of the invention, each entry 51 0 in 

each represent a document ttr v's-t dunng tl he Web crew sed> a status data 51 4 that is marked when 

the transaction log 31 0 ^'^^^^^Zes any errors encountered during processing, a user 

During processing, as new URLs are gathered from documents associated with the seeded entries, the new URLs are 
po^T™ 

Z^£T£Z"o wh e using Z ;i history map 308. An adaptive incremental crawl updates the exist ng 
nJex 210 as ft selSely revisits the URLs contained in the existing history map 308 and checks for changes to he 

Once initialized as a first full crawl, a full crawl, or an adaptive 
meth™ f and system of the Web craw, described in FIGURES 7-9 is essentia.* the same for a., types of Web crawls 
performed by the invention. 
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[0050] FIGURE 6 illustrates a process performed during a first full crawl 610. At step 612, the gatherer 304 creates 
a new transaction log 31 0 and a new history map 308, neither of which have any preexisting entries 41 0 or 51 0. The 
transaction log 310 is then loaded with one or more entries 510 containing "seed" URLs 512 in step 614. The inserted 
URLs 512 are referred to as "seeds" because they act as starting points for the Web crawl. 

5 [0051] In step 616, corresponding entries 410 are made in the history map 308 for each of the seed entries 510 
made in the transaction log 310, The history map entries 410 are initialized so that the time stamp 414, the hash value 
416, the crawl number crawled 41 8, the crawl number modified 420, change count 426 and the access count 428 are 
all set equal to zero or an equivalent "empty" or "null" value. The first access time 422 and the last access time 424 
are set to "null" values. At step 618, a new index 210 is created, and the Web crawl is performed at step 620. The 

10 operations performed during a Web crawl are detailed in FIGURE 9 and described below. Briefly described, during a 
first full crawl 610, all the documents identified in the transaction log 310 are unconditionally retrieved. After the Web 
crawl, the process illustrated in FIGURE 6 is complete. 

[0052] FIGURE 7 illustrates a process 710 performed during a "full crawl." The full crawl begins at step 712 by 
inserting one or more seed URLs 512 into entries 510 in the transaction log 310. At step 714, the full crawl deletes the 
15 old index and creates a new index 210. Unlike the first full crawl (FIGURE 6), the full crawl 710 opens an existing 
history map 308 in step 71 6. The existing history map 308 is used during the processing of the entries in the transaction 
log 31 0. In step 71 8, the Web crawl is performed in substantially the same manner as that illustrated in FIGURE 9 and 
discussed above. When the Web crawl is complete, the full crawl 710 is finished. 

[0053] FIGUTTE 8 illustrates a process 810 for performing an "adaptive incremental crawl" in accordance with the 
20 present invention. An adaptive incremental crawl is typically performed after either a full crawl or another adaptive 
incremental crawl. The purpose of an adaptive incremental crawl is to retrieve new documents or selectively retrieve 
documents that have been modified since the previous crawl. The adaptive incremental crawl selectively identifies 
documents that may be accessed based on a statistical model that uses the observed history of changes on previous 
accesses to the document. 

25 [0054] At step 812, the adaptive incremental crawl begins by opening an existing history map 310. Briefly described, 
at step 814, base probability and rate distributions are initialized for use in the process of "seeding" the transaction log 
310. The operations performed at step 814 are illustrated in detail in FIGURES 12A-C and described below. 
[0055] At step 815, the transaction log 310 is adaptively seeded with URLs. The operations performed at step 815 
are illustrated in detail in FIGURE 13 and described below. Briefly described, the seeding process selects entries, 

30 based on a statistical analysis, from the history map 308 for inclusion in the transaction log 310. In this way, the re- 
sources of the gatherer 304 may be focused on URLs corresponding to documents that are mostly likely to have 
changed since they were last accessed. 

[0056] After the transaction log is seeded, the index 210 is opened for update at step 816, and the Web crawl is 
performed at step 81 8. Again, the Web crawl is illustrated in FIGURE 9 and described below. The process then continues 
35 to step 820. 

[0057] At step 820, a training probability distribution computed during the Web crawl at step 81 8 is saved to be used 
as an active probability distribution for the next crawl. Training the training probability distribution is illustrated graphically 
in FIGURE 17 and described below. 

[0058] FIGURE 9 illustrates in detail a process performed during a Web crawl. The process begins at step 906, where 
40 the Web crawler 206 begins retrieving and processing URLs from the transaction log 310. Specifically, at step 906, a 
worker thread 312 retrieves a URL 512 from an unprocessed entry 510 in the transaction log 310. The URL is passed 
to the processing illustrated in FIGURES 10A and 10B at step 908. Briefly described, at step 908, a determination is 
made whether to retrieve the document identified by the URL, and if so, the document is retrieved. Each entry 51 0 in 
the transaction log 31 0 is processed in this manner until it is detected in a decision step 912 that all the entries 51 0 in 
45 the transaction log 310 have been processed. 

[0059] Although the process 620 is discussed herein with reference to a single worker thread 312, preferably the 
mechanism of the invention may include multiple worker threads 312, each worker thread, in conjunction with other 
components, being capable of performing a Web crawl. 

[0060] FIGURES 1 0A and 1 0B illustrate in detail the processing of a URL retrieved from the transaction log 31 0. To 
50 begin, at step 1 002, a determination is made whether the URL 51 2 for the current entry in the transaction log 31 0 has 
been processed during the current crawl. That determination is made by accessing the history map 308 to retrieve the 
crawl number crawled 418 associated with an entry 410 having the same URL 412 as the current entry 510 in the 
transaction log 310. If the crawl number crawled 418 for that entry matches the current crawl number, the URL has 
been processed during the current crawl, and the process 708 is complete for the URL. However, if the crawl number 
55 crawled 41 8 does not match the current crawl number, or if the history map 308 does not contain an entry for the URL, 
the URL has not been processed during the current crawl, and processing proceeds to decision step 1003. 
[0061 ] If the current crawl is a first full crawl, the decision step 1 003 passes control to step 1 006, where the document 
associated with the URL is unconditionally retrieved and the first access time 422 is set equal to the current time in 
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step 1 007 in other words, documents identified in the transaction log 31 0 are , unconditionally retrieved during the first 

SSln the M sump dtd s^d In .he ^"'J—SSt „ ^ us.no 0t, HTTP 
[0064] in one actual embodiment ot the mventw to the Web server addressed 

protocol, an HTTP *Get 

bytbeURL Tbisoommandmoledeanspec.t«at»nrta»eaamp . We6 , e ^ r 

me recal.ed «ma «*«r t a ™ ^^tXSZZ^P currently assoda.ed ««b ,be 
^Tent^S^^^ 

is made when other protocols are used to ^^^/"^^^ was relrjeved at step 100 8. Some Web servers 

100661 ^HT^ 

do not support the HTTP Get If-Modified since «*" , ^ document is retrieved at step 

mand. Therefore, receiving a new document at step 1008 a " d J 1 ™^ Hqw proceS sing continues 

1 01 0 does not guarantee that the IS S^SSt hL a more recent time stamp, 

to step 101 2 (FIGURE 1 0B) under the t »^^^^^^ Ul made whet her the document still exists. If 
[0067] If the document was .not re neved f^^^^ument are deleted from the index 210 and 

step 1019that determines if the time 

step 1030. If adetermmat,oncannotb^ 

is not marked as complete.This may ^. ur '^ e ^P te ;^™; thread may attem pt to retrieve the URL again later, 
the entry 510 for this URL is not marked as compete the wori«r ed number. After this predetermined 

the history map 308 and is compared wrth the new h ^^^£"™JJ the filte ? ed data corresponding to the 
the f iltered data corresponding to the newly retrieved doc ^J^^^^^ in the history map 308 is 

loor^.epl-.tr^n^^ 

craw, numba, mod«,ed «° I^^^^S^S ^^Smor^nt.ban l» - 
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value 416, the document time stamp 414, and the crawl number modified 420 that was set at step 1026. While not 
required, data from the document may be stored along with the newly computed hash value and document time stamp 
even if the hash values are equal. 

[0074] At step 1028, the URLs that are included as hyperlinks in the newly retrieved document are processed. The 
5 processing of the linked URLs at the step 1028 is illustrated in FIGURE 11 and discussed below. At step 1030, the 
status 514 for the entry 510 being processed is marked as processed. Besides being used in step 912 to determine if 
all the entries 510 have been processed, marking the entries 510 as they are completed assists in a recovery from a 
system failure by allowing the crawler to continue the crawl from where it left off. After step 1030, the processing of 
the URL is finished. 

io [0075] FIGURE 11 illustrates the processing of the linked URLs contained within a document. At step 1102, a linked 
U RL is retrieved from the filtered data passed from the filter daemon 31 4. At step 1 1 04, a determination is made whether 
the history map 308 contains the linked URL. If the history map does not contain the linked URL, at step 1106, the 
linked URL is added to the history map 308 and the entry 410 is initialized as discussed above. The linked URL is also 
added to the transaction log 310 at step 1108, and processing continues to decision step 1114. 

is [0076] If, at step 1104, it is determined that the history map 308 contains the linked URL, processing continues to 
step 1110, where a determination is made whether the crawl number crawled in the history map 308 associated with 
that URL is set to the current crawl number. A negative determination indicates that the linked URL has not yet been 
processed during the current crawl and the crawl number crawled is set to the current crawl number in step 1112, and 
the URL is addefi to the transaction log 31 0 in step 1 1 08. If the crawl number crawled 31B is equal to the current crawl 

20 number, the URL has already been added to the transaction log 310, the step 1108 is skipped, and the processing 
proceeds to step 1114. 

[0077] At decision step 1114, a determination is made whether there are any additional linked URLs in the filtered 
data. If any additional linked URLs exist, processing returns to step 1102, to process the next linked URL. If, at step 
1114, there are no more linked URLs to process, the processing of the linked URLs is complete. 

25 [0078] FIGURES 12A-C illustrate a process performed during an adaptive incremental crawl for initializing base 
probability and rate distributions. Those statistical distributions may be used as a starting point by the statistical model 
to determine if a document should be accessed. A probability distribution (base, document, training, or active) estimates 
a continuous probability function that a document has changed at a given change rate. Because of the constraints of 
current computer processing capabilities, the statistical model estimates the continuous probability function by tracking 

30 a plurality of probabilities at sample rates. The greater the "resolution", or number of probabilities at sample rates 
tracked, the better the estimate of the actual function. In an actual embodiment of the invention, the resolution is twenty 
sample points, or probabilities. This resolution is believed to provide a reasonable balance between accuracy and 
speed of computation. Of course, as the speed of computers increases, the resolution may be advantageously in- 
creased. 

35 [0079] Turning to FIGURE 12A, at step 121 0, a base probability distribution is initialized so that each probability in 
the distribution contains an estimated initial probability that one document will change with a certain change rate. These 
estimated initial probabilities need not be very accurate initially, since the method described below will improve the 
accuracy through training. However, more accurate initial probabilities may be preferable. 

[0080] A method of an actual embodiment of the-invention for estimating a set of starting values for the base prob- 

40 ability distribution is illustrated in FIGURE 12B. It has been estimated that approximately 30% of the documents on 
the Web will change at varying rates over many Web crawls, while the remaining approximately 70% of the documents 
will remain relatively static during that interval. Since the probability distribution will contain a set of probabilities P1 to 
Pn that sum to 1 (or in percentages: 100%) regardless of the resolution, 30% of the 100% is distributed evenly over 
P1 to P(n-1 ) such that P(n) = .3/(n-1 ). The remaining 70% of the 1 00% of probabilities is assigned to the last probability 

45 (Pn = .7) in the distribution. 

[0081] Expressed in this way, the base probability distribution, and all probability distributions that descend from it, 
represent the probability that the document will change at a given rate, over a plurality of sample rates. It will be apparent 
to one skilled in the art that there are many ways to estimate and express initial base probability distributions while 
remaining within the spirit and scope of the present invention. For instance, the initial probability rates may be set to 

50 anything from normalized random numbers to actual probability rates determined experimentally over time. 

[0082] Returning to FIGURE 12A, a base rate distribution is provided for the statistical computations regarding the 
document. The base rate distribution reflects the selection of the sample points at which the continuous probability 
function will be estimated. At step 1212, the base rate distribution is initialized. One computation for initializing the 
base rate distribution is illustrated in FIGURE 12C. In the base rate distribution, a plurality of change rates are chosen 

55 and expressed in an actual embodiment of the invention as number of changes per second. Each change rate has a 
corresponding probability in the base probability distribution (i.e., the base distributions have the same resolution). In 
an actual embodiment of the invention, the first rate R1 to rate R(N-1) are chosen at evenly spaced change rates 
between a Low change rate and a High change rate using the formula: 
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Formula 1 : 



36O0^High + \{n - O^ ^J^ ) 
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[0088] At step 1312, the next entry 410 is 
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[0089] At decision step 1316, if the response f rom the » ^* "J^J^n log 310. The process then returns 
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to decision step 1310, where the P^^^^^^^Z URL 412 to the transaction log 310. 
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[0090] When every document in the hstory map 308 _ has ; been proc . d j Stributi on for he next crawl, 

to step 1320, where the training probability ^"^'^f^ be accessed based 
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since the last time the document was accessed. Irv otherwo^J JjP^™" ^ther document may hav e 
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change rate. After the document probability distribution for the document is calculated, trie procelTcontinues to step' 
1422. 

[0094] At step 1422, a weighted sum of the probabilities in the document probability distribution is taken according 
to the Poisson model, with DT equal to the time since the last access of the document (i.e., DPD[1] * (E A («R[1]*DT)) + 
5 DPD[2] * (E A (-R[2]*DT)) + ... + DPD[n] * (E A (-R[n]*DT)). The weighted sum thus computed is the probability that the 
document has not changed (PNC). The probability that the document has changed (PC) is the complement of PNC 
(PC =1 -PNC). 

[0095] At step 1424, a probability that the document will be accessed (PA) may be optionally computed and biased 
by both a specified synchronization level (S) and the probability that the document has changed (PC). In other words, 

10 this embodiment of the invention optionally allows the ultimate decision whether to retrieve a document to be biased 
by a synchronization level, specified by a system administrator. By adjusting the synchronization level for different 
crawls, a system administrator may bias the likelihood of retrieving documents in accordance with the administrator's 
tolerance for having unsynchronized documents. Thus, using the formula P A = 1 - ((1-S)/PC), where S is the desired 
synchronization level and PC is the probability that the document has changed as calculated in step 1 422, a probability 

15 (PA) that the document should be accessed is calculated. 

[0096] At step 1426, a coin flip is generated with a "heads" bias equal to the probability of access (PA) computed in 
step 1424. A decision is made to either "access" or "not access" the document based on the result of this coin flip. The 
coin flip is provided because it may be desirable to add a random component to the retrieval of documents in order to 
strike a balanc^between the conservation of resources and ensuring document synchronization. The bias PA calcu- 

20 lated at step 1424 is applied to the coin flip to influence the outcome in favor of the likelihood that the document has 
changed, modified by the desired synchronization level. The outcome of the coin flip is passed to decision step 1430. 
[0097] At decision step 1 430 if the outcome of the coin flip is "heads", the instruction to "access document" is returned 
at step 1412. Otherwise, the instruction "don't access document" is returned at step 1432. Following steps 1412 or 
1432, the process of FIGURES 14A and 14B is done. 

25 [0098] FIGURE 15 illustrates a process performed to calculate a document probability distribution. The process 
begins at step 1510 by making a copy of the active probability distribution as a new instance of a document probability 
distribution. At step 1516, the document probability distribution is trained using a statistical model that reflects the 
change rate patterns of the document as experienced during previous Web crawls. The training of the document prob- 
ability distribution is illustrated in detail in FIGURES 16A1-2 and described below. Briefly described, the document 

30 probability distribution is trained for "change," "no change," and "no change chunk" event intervals using a discrete 
random -variable distribution. Once the document probability distribution has been trained, the process continues to 
step 1518, where the document probability distribution is added to the training probability distribution as illustrated in 
more detail in FIGURE 17. The document probability distribution is returned to step 1418 of FIGURE 14A in step 1520, 
and the process illustrated in FIGURE 15 is finished. 

35 [0099] FIGURES 16A1-2 illustrate a process for training the document probability distribution. At step 1610, the 
accesses 428 to a document are mapped to a timeline. One example of such a timeline is illustrated in FIGURE 16B 
and described below. Briefly described, the history map 308 contains the first access time 422, the last access time 
424, the change count 426, and the access count 428 for each document identified in the history map 308. The timeline 
begins at the first access time 422 and ends at the last access time 424. The timeline is then divided into a number of 

40 uniform intervals equal to the number of accesses in the access count 428. The process then continues to step 161 2. 
[0100] At step 1612, the process assumes that the amount of time between each change (identified by the change 
count 426) is uniform. Thus, the changes are evenly distributed on the timeline. The information necessary for the 
application of the Poisson process can be derived from the mapping of the changes to the timeline. The process 
continues from step 1612 to step 1614. 

45 [0101] At step 1614, several variables are calculated from the historical information in each entry 410 for use in the 
training of the document probability distribution. The average time between accesses (intervals) is computed and stored 
as the interval time (DT). The number of intervals between changes is calculated (NC). The number of intervals in 
which a change occurred is calculated (C). A group of intervals between changes is termed a "no change chunk." 
Accordingly, the number of no change chunks (NCC) is calculated. And, finally, the length of time of each no change 

50 chunk (DTC) is calculated. 

[0102] An event probability distribution for a no change event is computed in a step 1630. The event probability 
distribution includes a plurality of probabilities (EP[N]) that the event will occur at a given change rate (N) for the interval 
(DT) experienced with the no change events. Each probability EP[N] is computed using the Possion process: EPfN] = 
e A ( : R[N] * DT) where e is the transcendental constant used as the base for natural logarithms, R[N] is the rate of 

55 change and DT is the time interval of the event. At step 1 632, the event probability distribution EP[N] calculated at step 
1630 is passed to a process for training the document probability distribution for the no change events. The operations 
performed by the process to train the document probability distribution for each no change event are illustrated in detail 
in FIGURE 16C1-2 and described below. 
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'button incudes a plurality of probabiht.es ^^^^ZpuJus^g the Possion process: EP[N] = 
(DT) experienced with the change events • | ^iJ?5S!Sto^b. calculated by taking the compiement of each 
1 -e-tRlNl * DT). Alternatively, the event probab l.ty distribution may <JB|cuIatod in step 1 630). At step 

probibi ity in the event probability distribute calc u a ^ f ° r *° n0 * to . process for training the document 

?634, theevent probabi.ity distribution EP[N] "f^J^^ performed by the process to train 

probability distribution for the change events ,„ rgURE 16C1-2 and described below. 
Ldocumentprobabiinydistribut.cn are . nus ^ 

[0104] Atastep1635,aneventprobabrt.^ eyent wjl| Qccur at a given cnange 

distribution includes a plura.ity of P^^<f™^^ k events. Each probabilfty EP[N] is computed using 
rate (N) for the interval (DTC) interpolated for the no change chunk ev J )on EP[N] calculated at step 

~= pTs — L the no change chunk events, as 

W the document probabi.fry ^"bu^ events/nterva.s are trained in steps 

ability distribution is trained for each no change chunk interval ■ ' proba t>«ity distribution is completely trained, 

1632. 1634, and 1638 is be.ieved to be °™^Z ™2 art wil. appreciate that afternative statistical 

the process of RGURE 16A is done at step ^.^Zd* ^"St on wrmoutdeviaSng from the spirit of the invention, 
models may be employed to train the documen P« ^Xnstructed in accordance with the process of 
10106] FIGURE 16B is a graphical repres^on of 62Q The time of the first access, as stored in 
FIGURE 16A. Each pair of adjacen t accesses 161 8 del ^ access as stored in tne histo ry map 308. 

the history map 308, ^ es *«/^ in time between the last access time 

on the timeline 1 61 6. ^ nntam « chanae event 1 61 9 is considered to contain a no change 

[0107] in general, an interval 1620 that does , «t ^^^^^cafcu^by th. Poisson equation 
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no change chunk 1628 is a group of no f han f '"^^'^'^hcannot be evenly placed into a no change chunk 

^^^^ 

the present invention is not limited to the ^^^^ ^ n g tne docum ent probability distribution for oc- 
[0108] FIGURES 16C1-2 illustrate one ^^'^^^3 change event and no change chunk event), 
currence of an event for each passed event type (e* "° ° han 9 e eV °£ ' NCC) 9 istrained . At step 1 652. the probability 
Beginning with step 1 650. each occurrence o each probability in the document probability 
of the event occurring is computed by summing \ he r ^^ ^^probability that the event has occurred (given a 

event occurring, i.e., DPD[Nl = (DPDJN] E ^|»^ from step 1658 is checked in a decision step 1660 for an 
[0112] FIGURE ^.illustrates the updateof tnetra '7 e g P r ™^ t °' , from the actjve probability distribution and 
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zero. After each document probability distribution is calculated, each probability (Pn l ) in the document "probability dis- 
tribution is added to a corresponding probability (Pn) in the training probability distribution 1 71 0. In this way, the training 
probability distribution aggregates the experience with all the documents probability distributions calculated for the 
adaptive incremental crawl. The training probability distribution 1710 becomes the active probability distribution for the 
next crawl once it is normalized 1714 to sum to zero. 

[01 13] FIGURE 18 illustrates an exemplary process 1802 for handling a Web search request in accordance with the 
present invention. At a step 1 804, a search engine 230 (FIGURE 2) receives a search request from a client application 
such as the Web browser 215. If the user wishes to receive only those documents that have changed in some sub- 
stantive way since the last time the search request was run, the Web browser 21 5 (or other server or client application) 
sending the search request implicitly adds a clause to the search request that limits the search to only return those 
documents that have a crawl number modified that is greater than a stored crawl number associated with the last time 
the search request was processed by the search engine 230 (step 1205). The stored crawl number is retained in a 
search request history 250 (FIGURE 2) and represents the crawl number of the most recent crawl that preceded the 
last time that the search request was processed. 

[0114] At step 1 806, the search engine 230 searches the index 21 0 for entries matching the specified criteria. The 
search engine 230 returns to the client computer 214 search results that include zero, one, or more "hits" at a step 
1808. Each hit corresponds to a document that matches the search criteria. A "match" includes having a crawl number 
modified that is more recent than the stored crawl number specified in the search request. After the search is performed, 
at step 1810, tflfc client application 215 implicitly asks the search engine 230 to return the crawl number of the most 
recently performed crawl, which it then stores with the search request in a search request history. 
[0115] While the preferred embodiment of the invention has been illustrated and described, it will be appreciated 
that various changes can be made therein without departing from the spirit and scope of the invention as defined by 
the appended claims. 



Claims 

1 . A computer-implemented method for selectively accessing a document during a current crawl of a server computer, 
the document being identified by a document address specification, the document having been retrieved during a 
previous crawl, the method comprising: 

determining whether to access the document during the current crawl with the aid of a statistical model; and 
accessing the document if the determination produces an instruction indicative that the document at the doc- 
ument address specification should be accessed during the current crawl. 

2. The method of Claim 1 , wherein determining whether to access the document further comprises computing a 
probability that the document has changed since the document was retrieved during the previous crawl. 

3. The method of Claim 2, wherein computing the probability that the document has changed further comprises: 

selecting an active probability indicative of a proportion of documents in a plurality of documents that are 

changing at various change rates, the plurality of documents including the document; 

training the active probability to reflect an experience with the document during a plurality of previous crawls; 

and 

using the trained active probability to compute the probability that the document has changed. 

4. The method of Claim 3, further comprising: 

selecting the probability that the document has changed from the previous crawl as the active probability in 
the current crawl; and 

repeating the method of Claim 3 for the current crawl. 

5. The method of Claim 3, wherein training the active probability includes multiplying the active probability indicative 
of a change in the document by a training probability calculated using a statistical model. 

6. The method of Claim 1 , wherein the statistical model further comprises: 

training a document probability distribution corresponding to the document address specification to reflect an 
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experience with the document during a plurality o, previous craw.s. the document' pre-distribution in- 
eluding a plurality of probabilities; H . trihllfion a Dra babilitv that the document has changed; and 

the document has changed. 
7 The method of Claim 6, further comprising: 

from the discrete random variable distribution. 

,0 A oomput.r-Tead.e.e medium having comptd»-a*ecuttble idem**™ -or -** 9 one document h a plurality 
efSmente from a remote -v. ««* «*" """""^ 
ingh*,or*a«^^ 

formation associated with the changes to the one document at the remote server. 
1 1 The computer-readable medium of Claim 1 0, further comprising: 

. the determination to access the one document is positive, identifying the one document for retrieva. during 

'SZSX^**^ «— - during the crawl procedure - 

„ The computer-readab.e medium ofdaimlO, wherein determining whether »^****~*««~~*- 
PriSeS computingaprobabilitythatthe one documenthas changed since the one document was last retrieved from 
the remote server. 

,3 The »p«-*. medium - <— «. —in computing the ^ ~ « -a decuman, h„ 

the one docameot > the document la likely to have changed. 
,«. The corapu,e,,e.d,P,. median, o, Claim ,4. wherein the random daa.lon la made 0, a aonware roo.lne adapted 
to simulate a flip of a coin, 
i 17 The computer-readable medium of Claim 1 0, wherein: 
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document was last retrieved from the remote server; and 

wherein the analysis includes a comparison of the time stamp included in the historical information with another 
time stamp associated with the one document stored on the remote server 

5 18. The computer-readable medium of Claim 1 7, further comprising: 

if the time stamp included in the historical information does not match the other time stamp associated with 
the one document stored on the remote server, identifying the one document for retrieval during the crawl proce- 
dure. 

io 19. The computer-readable medium of Claim 1 0, wherein: 

the historical information associated with changes to the one document includes a hash value associated with 
the one document, the hash value being a representation of the one document; and 

wherein the analysis includes a comparison of the hash value included in the historical information with another 
is hash value calculated from information retrieved from the one document stored on the remote server. 

20. The computer-readable medium of Claim 1 9, if the hash value included in the historical information does not match 
the other hash value associated with the one document stored on the remote server, identifying the one document 
for retrievaTOuring the crawl procedure. 
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